Recurrent neural networks

Per Unneberg

20 January, 2022

Overview

Recap

Sequential models

Recurrent neural networks (RNNs)

LSTMs and GRUs

Practical applications

Recap

Perceptron (single neuron)

Architecture

A single neuron has \(n\) inputs \(x_i\) and an output \(y\). To each input is associated a weight \(w_i\).

Activity rule

The activity rule is given by two steps:

\[a = \sum_{i} w_ix_i, \quad i=1,...,n\]

\[\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}\]

(MacKay, 2003)

Perceptron (single neuron)

Architecture

A single neuron has \(n\) inputs \(x_i\) and an output \(y\). To each input is associated a weight \(w_i\).

Activity rule

The activity rule is given by two steps:

\[a = \sum_{i} w_ix_i, \quad i=1,...,n\]

\[\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}\]

(MacKay, 2003)

Perceptron (single neuron)

\[a = w_0 + \sum_{i} w_ix_i, \quad i=1,...,m\]

\[y = y(a) = g\left( w_0 + \sum_{i=1}^{m} w_ix_i \right)\]

or in vector notation

\[y = g\left(w_0 + \mathbf{X^T} \mathbf{W} \right)\]

where:

\[\quad\mathbf{X}= \begin{bmatrix}x_1\\ \vdots \\ x_m\end{bmatrix}, \quad \mathbf{W}=\begin{bmatrix}w_1\\ \vdots \\ w_m\end{bmatrix}\]

(Alexander Amini, 2021)

Simplified illustration and notation

Architecture

Vectorized versions: input \(\boldsymbol{x}\), weights \(\boldsymbol{w}\), output \(\boldsymbol{y}\)

Activity rule

\[a = \boldsymbol{wx}\]

Feed forward network

Simplified illustration

Sequential models

Motivation

Motivation

Motivation

Motivation

A few concrete examples of real-world sequences

Language translation

Time series

(Herzen et al., 2021)

Genomics

(Shen et al., 2018)

Types of models

one to one

many to one

one to many

many to many

Image classification

Sentiment analysis

Image captioning

Machine translation

(Karpathy, 2015)

Recurrent Neural Networks (RNNs)


Feed forward network implementation to sequential data

Assume multiple time points.

Feed forward network implementation to sequential data

Assume multiple time points.

Feed forward network implementation to sequential data

Assume multiple time points.

  • Dependency of inputs not modelled such that ambiguous sequences cannot be be distinguished:

“dog bites man” vs “man bites dog”

Feed forward network implementation to sequential data

Assume multiple time points.

  • Time points are modelled individually ( \(\hat{Y}_t = f(X_t)\) )

Feed forward network implementation to sequential data

Assume multiple time points.

  • Time points are modelled individually ( \(\hat{Y}_t = f(X_t)\) )
  • Also want dependency on previous inputs ( \(\hat{Y}_t = f(..., X_2, X_1)\) )

Adding recurrence relations

Adding recurrence relations

Adding recurrence relations

Adding recurrence relations

Folded representation

Unfolded representation

Add a hidden state \(h\) that introduces a dependency on the previous step:

\[ \hat{Y}_t = f(X_t, h_{t-1}) \]

Sequential memory of RNNs

RNNs have what one could call “sequential memory” (Phi, 2020)

Alphabet

Exercise: say alphabet in your head

    A B C ... X Y Z

Modification: start from e.g. letter F

May take time to get started, but from there on it’s easy

Now read the alphabet in reverse:

    Z Y X ... C B A

Memory access is associative and context-dependent

Recurrent Neural Networks


Add recurrence relation where current hidden cell state \(h_t\) depends on input \(x_t\) and previous hidden state \(h_{t-1}\) via a function \(f_W\) that defines the network parameters (weights):

\[ h_t = f_\mathbf{W}(x_t, h_{t-1}) \]

Note that the same function and weights are used across all time steps!

Recurrent Neural Networks - pseudocode


class RNN:
    def __init__(self):
        # Initialize weights and cell state
        self._h = [...]
        self._Whh = [...]
        self._Wxh = [...]
        self._Why = [...]

    def update_cell_state(self, x):
        # function is some function that updates cell state
        self._h = function(self._h * self._Whh + x * self.Wxh)
    
    def predict(self):
        return self._h * self._Why

    def update_weights(self, y):
        # Calculate error via some loss function
        error = loss(self.predict() - y)
        # update weights via back propagation...

rnn = RNN()

for x, y in input_data:
    rnn.update_cell_state(x)
    rnn.update_weights(y)

# Retrieve next prediction
yhat = rnn.predict() 

Vanilla RNNs


Output vector

\[ \hat{Y}_t = \mathbf{W_{hy}^T}h_t \]

Update hidden state

\[ h_t = \mathsf{tanh}(\mathbf{W_{xh}^T}X_t + \mathbf{W_{hh}^T}h_{t-1}) \]

Input vector

\[ X_t \]

Vanilla RNNs


(Olah, 2015)

Vanilla RNNs


Vanilla RNNs


Vanilla RNNs


Note: \(\mathbf{W_{xh}}\), \(\mathbf{W_{hh}}\), and \(\mathbf{W_{hy}}\) are shared across all cells!

Desired features of RNN

  1. Variable sequence lengths

Not all inputs are of equal length

  1. Long-term memory

“I grew up in England, and … I speak fluent English”

  1. Preserve order

“dog bites man” != “man bites dog”

  1. Share parameters

Adresses points 2 and 3.

Example: Box & Jenkins airline passenger data set

(Onnen, 2021)

Example: generate test and training data

Partition time series into training and test data sets at an e.g. 2:1 ratio:

import rnnutils
import numpy as np
df = rnnutils.airlines()
data = np.array(df['passengers'].values.astype('float32')).reshape(-1, 1)
train, test, scaler = rnnutils.make_train_test(data) 

Example: prepare data for keras

time_steps = 12
trainX, trainY, trainX_indices, trainY_indices = rnnutils.make_xy(train, time_steps)
testX, testY, testX_indices, testY_indices = rnnutils.make_xy(test, time_steps) 

Example: create vanilla RNN model

from keras.models import Sequential
from keras.layers import Dense, SimpleRNN

model = Sequential()
model.add(SimpleRNN(units=3, input_shape=(time_steps, 1),
                    activation="tanh"))
model.add(Dense(units=1, activation="tanh"))
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary() 
## Model: "sequential"
## _________________________________________________________________
## Layer (type)                 Output Shape              Param #   
## =================================================================
## simple_rnn (SimpleRNN)       (None, 3)                 15        
## _________________________________________________________________
## dense (Dense)                (None, 1)                 4         
## =================================================================
## Total params: 19
## Trainable params: 19
## Non-trainable params: 0
## _________________________________________________________________

Example: fit the model and evaluate

history = model.fit(trainX, trainY, epochs=20, batch_size=1, verbose=2)
Ytrainpred = model.predict(trainX)
Ytestpred = model.predict(testX) 
rnnutils.plot_history(history) 
data = {'train': (model.predict(trainX), train, trainY_indices),
        'test': (model.predict(testX), test, testY_indices)}
rnnutils.plot_pred(data, scaler=scaler, ticks=range(0, 144, 20), labels=df.year[range(0, 144, 20)]) 

Exercise

See if you can improve the airline passenger model. Some things to try:

  • change the number of units
  • change time_steps
  • change the number of epochs

Training

Recap: backpropagation algorithm in ffns

(Alexander Amini, 2021)

Recap: backpropagation algorithm in ffns

(Alexander Amini, 2021)


  1. perform forward pass and generate prediction

Recap: backpropagation algorithm in ffns

(Alexander Amini, 2021)


  1. perform forward pass and generate prediction
  2. calculate prediction error \(\epsilon_i\) wrt (known) output: \(\epsilon_i = \mathcal{L}(\hat{y}_i, y_i)\), loss function \(\mathcal{L}\)

Recap: backpropagation algorithm in ffns

(Alexander Amini, 2021)


  1. perform forward pass and generate prediction
  2. calculate prediction error \(\epsilon_i\) wrt (known) output: \(\epsilon_i = \mathcal{L}(\hat{y}_i, y_i)\), loss function \(\mathcal{L}\)
  3. back propagate errors and update weights to minimize loss

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Errors are propagated backwards in time from \(t=t\) to \(t=0\).

Problem: calculating gradient may depend on large powers of \(\mathbf{W_{hh}}^{\mathsf{T}}\) (e.g. \(\delta\mathcal{L} / \delta h_0 \sim f((\mathbf{W_{hh}}^{\mathsf{T}})^t)\)

The effect of vanishing gradients on long-term memory

In layer \(i\) gradient size ~ \((\mathbf{W_{hh}}^{\mathsf{T}})^{t-i}\)

\(\downarrow\)

Weight adjustments depend on size of gradient

\(\downarrow\)

Early layers tend to “see” small gradients and do very little updating

\(\downarrow\)

Bias parameters to learn recent events

\(\downarrow\)

RNN suffer short term memory

(Olah, 2015)

“The clouds are in the _


“I grew up in England … I speak fluent _

Solutions to vanishing gradient

  1. Activation function

    ReLU (or leaky ReLU) instead of sigmoid or tanh

  1. Weight initialization

    Set bias=0, weights to identity matrix

  1. More complex cells using “gating”

    For example LSTM

LSTMs and GRUs

Motivation behind LSTMs and GRUs

LSTM

GRU

Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014) architectures were proposed to solve the vanishing gradient problem.

Intuition

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)

Intuition

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)


Remember the important parts, pay less attention to (forget) the rest.

LSTM: Cell state flow and gating

(Olah, 2015)

Information flows in the cell state from \(c_{t-1}\) to \(c_t\).

Gates affect the amount of information let through. The sigmoid layer outputs anything from 0 (nothing) to 1 (everything).

(Cho et al., 2014)

In our preliminary experiments, we found that it is crucial to use this new unit with gating units. We were not able to get meaningful result with an oft-used tanh unit without any gating.

Forget, input, and output gates

forget gate

Purpose: reset content of cell

input gate

Purpose: decide when to read data into cell

output gate

Purpose: read entries from cell

Sigmoid squishes vector \([\boldsymbol{h_{t-1}}, \boldsymbol{x_t}]\) (previous hidden state + input) to \((0, 1)\), where anything from no information (0) to all information (1) passes through the gate.

The forget gate

Purpose: decide what information to keep or throw away

Sigmoid squishes vector \([\boldsymbol{h_{t-1}}, \boldsymbol{x_t}]\) (previous hidden state + input) to \((0, 1)\), where 0=forget, 1=keep.

\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \]

Add new information - the input gate

Two steps to adding new information:

  1. sigmoid layer decides which values to update

Add new information - get candidate values

Two steps to adding new information:

  1. sigmoid layer decides which values to update
  2. tanh layer creates vector of new candidate values \(\tilde{c}_t\)

\[ i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c) \]

Updating the cell state

  1. multiply old cell state by \(f_t\) to forget what was decided to forget
  2. add \(i_t * \tilde{c}_t\)

\[ c_t = f_t * c_{t-1} + i_t * \tilde{c}_t \]

Cell output

Output is filtered version of cell state.

  1. sigmoid output gate decides what parts of cell state to output
  2. push cell state through tanh and multiply by sigmoid output

\[ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t) \]

LSTM: putting it together

Intuition

  • if forget ~ 1, input ~ 0, \(c_{t-1}\) will be saved over time

LSTM: putting it together

(Zhang et al., 2021)

\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\\ i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c)\\ c_t = f_t * c_{t-1} + i_t * \tilde{c}_t\\ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t) \]

\[ x_t \in \mathbb{R}^{n\times d}, h_{t-1} \in \mathbb{n \times h}, i_t \in \mathbb{R}^{n\times h}, f_t \in \mathbb{R}^{n\times h}, o_t \in \mathbb{R}^{n\times h}, \]

and

\[ W_f \in \mathbb{R}^{n \times (h+d)}, W_i \in \mathbb{R}^{n \times (h+d)}, W_o \in \mathbb{R}^{n \times (h+d)}, W_c \in \mathbb{R}^{n \times (h+d)} \]

GRU

  • forget and input states combined to single update gate
  • merge cell and hidden state
  • simpler model than LSTM

Exercise

1. Analyse airline passengers with LSTM

Modify the airline passenger model to use an LSTM and compare the results. Try out different parameters to improve test predictions.

2. Time-series forecasting with LSTM, discrete state space

LSTM with Variable Length Input Sequences to One Character Output

Time-series forecasting with LSTM, discrete state space

Objective

Predict next character in sequence of strings

Comments

  • you may have to use several LSTM layers, in which the last layer should not return sequences (set return_sequences=False) (Brownlee, 2017)

Bibliography

Alexander Amini. (2021). MIT 6.S191: Recurrent Neural Networks. https://www.youtube.com/watch?v=qjrad0V0uJE
Brownlee, J. (2017). Stacked Long Short-Term Memory Networks. In Machine Learning Mastery. https://machinelearningmastery.com/stacked-long-short-term-memory-networks/
Cho, K., Merrienboer, B. van, Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 [Cs, Stat]. http://arxiv.org/abs/1406.1078
Herzen, J., Lässig, F., Piazzetta, S. G., Neuer, T., Tafti, L., Raille, G., Pottelbergh, T. V., Pasieka, M., Skrodzki, A., Huguenin, N., Dumonal, M., Kościsz, J., Bader, D., Gusset, F., Benheddi, M., Williamson, C., Kosinski, M., Petrik, M., & Grosch, G. (2021). Darts: User-friendly modern machine learning for time series. https://arxiv.org/abs/2110.03224
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Karpathy, A. (2015). The Unreasonable Effectiveness of Recurrent Neural Networks. https://karpathy.github.io/2015/05/21/rnn-effectiveness/
MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms (Illustrated edition). Cambridge University Press.
Olah, C. (2015). Understanding LSTM Networks – colah’s blog. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Onnen, H. (2021). Temporal Loops: Intro to Recurrent Neural Networks for Time Series Forecasting in Python. In Medium. https://towardsdatascience.com/temporal-loops-intro-to-recurrent-neural-networks-for-time-series-forecasting-in-python-b0398963dc1f
Phi, M. (2020). Illustrated Guide to Recurrent Neural Networks. In Medium. https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9
Shen, Z., Bao, W., & Huang, D.-S. (2018). Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Scientific Reports, 8(1), 15270. https://doi.org/10.1038/s41598-018-33321-1
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive into deep learning. arXiv Preprint arXiv:2106.11342.